Introduction to Open Data Science - Course Project

About the project

This course - Introduction to Open Data Science - makes use of the open research tools with open data and how to apply data-driven statistics.

Link to my project: https://github.com/SusannaSuntila/IODS-project

date()
## [1] "Mon Dec  5 13:33:04 2022"

Thoughts about the course

I am excited and I only hope that I can keep up with the pace and content of this course, but it all sounds very interesting. I have taken some courses in R, but I have not used it professionally or for that long. I want to become even more familiar in using R and I want to deepen my skills in visualizing, sharing, wrangling, testing and analyzing data. I heard about this course in my previous course for research skills.


Learning experience with the R book and Exercise set 1

  • The first exercise was a good introduction, as I was already a bit familiar with the book, but now I could focus on the parts that were more difficult for me, or that I hadn’t really used before.
  • Some more simple functions, like filter() or group_by() are something that I will try to make use of even more.
  • In my last R-course I was introduced to ggplot, and drawing plots and fine-tuning them was definitely my favorite part of this exercise. I hope to develop my skills on that score during this course.
  • Working with factors was also part of the book that I liked, and there were functions that I hadn’t used, like summary_factorlist.
  • I haven’t really used the tidy() function to tidy up the results of a statistical tests, so reading these chapters was good reminder of that option as well.
  • I haven’t done any reshaping of data with regards to the format, especially tidying from a wide to a long format. That was a more difficult part of the book for me, but it is definitely something I want to practice.
  • Different statistical tests and plotting data in order to choose the right test was also a more demanding part for me and something that I want to focus on during this course.
  • Many of the functions and approaches used from the finalfit-package were more difficult as well, as I haven’t used much of those.
  • I like the so called active reading, as it helps to understand better what is happening and also enables to try some changes to the code myself. I think the book is clear and well constructed.
  • I have a lot to learn about using R-markdown, so this has been a good start. There has been quite a lot of new concepts in the first week.

Regression and model validation

date()
## [1] "Mon Dec  5 13:33:04 2022"

Read in the data and describe it

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0      ✔ purrr   0.3.5 
## ✔ tibble  3.1.8      ✔ dplyr   1.0.10
## ✔ tidyr   1.2.1      ✔ stringr 1.4.1 
## ✔ readr   2.1.3      ✔ forcats 0.5.2
## Warning: package 'ggplot2' was built under R version 4.2.2
## Warning: package 'purrr' was built under R version 4.2.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
# the data-set that I created
learning2014 <- read.csv(file.path(".", "data", "learning2014.csv"))
# Look at the dimensions of the data
dim(learning2014)
## [1] 166   7
# Look at the structure of the data
str(learning2014)
## 'data.frame':    166 obs. of  7 variables:
##  $ gender  : chr  "F" "M" "F" "M" ...
##  $ age     : int  53 55 49 53 49 38 50 37 37 42 ...
##  $ attitude: num  3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
##  $ deep    : num  3.58 2.92 3.5 3.5 3.67 ...
##  $ stra    : num  3.38 2.75 3.62 3.12 3.62 ...
##  $ surf    : num  2.58 3.17 2.25 2.25 2.83 ...
##  $ points  : int  25 12 24 10 22 21 21 31 24 26 ...

There are 166 rows (observations) and 7 columns (variables) in this data. The gender variable is a character, the other variables are numbers or integers.

The original data is from a questionnaire that tracked different learning approaches and achievements on an introductory statistical course at university level. Gender and age are self-explanatory. The points variable tells how many exam points the person had. The attitude variable is a sum of 10 questions related to student’s attitude towards statistics, in a 1-5 scale. The remaining variables deep, stra and surf are combination variables that have been combined from multiple questions with the same dimension. The variables are averages of the selected questions concerning deep, surface and strategic learning.

Graphical overview of the data and summaries of the variables

library(ggplot2)

# set the theme white
theme_set(theme_bw())


# plot of student's attitude and points
ggplot(learning2014, aes(x = attitude, y = points, col = gender)) +
  geom_point() +
  geom_smooth(method = "lm") +
  scale_colour_manual(values = c("lightpink2", "maroon")) +
  labs(title = "Student's attitude compared to exam points",
       subtitle =  "learning2014 data")
## `geom_smooth()` using formula = 'y ~ x'

library(GGally)
## Warning: package 'GGally' was built under R version 4.2.2
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
library(ggplot2)

# draw a scatter plot matrix of the variables in learning2014.
# [-1] excludes the first column (gender)
pairs(learning2014[-1])

# add color by gender
gender_col <- c("lightpink2", "maroon")[unclass(factor(learning2014$gender))]

pairs(learning2014[-1], pch = 19, col = gender_col)

# create a more advanced plot matrix with ggpairs()
ggpairs(learning2014, mapping = aes(), lower = list(combo = wrap("facethist", bins = 20)))

# with color by gender
ggpairs(learning2014, mapping = aes(col = gender, alpha = 0.3), lower = list(combo = wrap("facethist", bins = 20))) +
  scale_colour_manual(values = c("lightpink2", "maroon")) +
  scale_fill_manual(values = c("lightpink2", "maroon"))

First, there are clearly more women who have participated in this questionnaire than men. When looking at the distribution of the variables, it is clear that age is concentrated below thirty, with a long tail towards right. With the age variable, the skew is larger than with any other variable. There is also more variation between men compared to women. Interestingly with the attitude variable, there is more variation with women compared to men. Men also have slightly higher mean, and their distribution is more skewed. The deep variable has a longer tail on the left side, so the observations are more concentrated on the higher side. The difference between genders is very small in this case. The stra variable has a wide distribution. The surf variable is more concentrated among women compared to men. The point variable has a thick tale among lower scores and the distribution is more concentrated among higher scores.

The largest and most significant correlation can be found with attitude and points. It is positive, so better attitude is related to better points in the exam. The lowest correlation is between deep learning and points from the exam which is interesting as well. Surf variable is also significantly correlating with attitude and deep.

Regression model with three explanatory variables

library(GGally)
library(ggplot2)

# draw a plot of the linear relation of exam points and attitude
qplot(attitude, points, data = learning2014) + geom_smooth(method = "lm")
## Warning: `qplot()` was deprecated in ggplot2 3.4.0.
## `geom_smooth()` using formula = 'y ~ x'

# create a regression model with multiple explanatory variables
model1 <- lm(points ~ attitude + gender + stra, data = learning2014)

# print out a summary of the model
summary(model1)
## 
## Call:
## lm(formula = points ~ attitude + gender + stra, data = learning2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.7179  -3.3285   0.5343   3.7412  10.9007 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   8.9798     2.4030   3.737 0.000258 ***
## attitude      3.5100     0.5956   5.893 2.13e-08 ***
## genderM      -0.2236     0.9248  -0.242 0.809231    
## stra          0.8911     0.5441   1.638 0.103419    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.304 on 162 degrees of freedom
## Multiple R-squared:  0.2051, Adjusted R-squared:  0.1904 
## F-statistic: 13.93 on 3 and 162 DF,  p-value: 3.982e-08

In this model the dependent variable is exam points as instructed. The three explanatory variables that I chose are attitude, gender and stra. In the summary of the model it can be seen that only the intercept and the variable attitude are statistically significant, so attitude is the only variable strongly associated with exam points in this particular model. The coefficient on attitude, 3.5, tells that one unit change in attitude is related to a 3.5 point increase in exam points, conditional on other variables of this model. The gender variable is a dummy variable that has the value 1 when the person’s gender is male, so the coefficient of the gender variable describes the difference between the two genders, and when the person is male, it lowers the exam points by -0.22, again conditional on the other variables of this model. The stra variable that is computed from strategic questions has a coefficient of 0.89, meaning that better strategic learning will increase the exam points. The F-test that all three coefficients would be zero has a very small p-value, so it can be rejected.

As the variables gender and stra were not statistically significant, I dropped the gender variable first in the next model.

Regression model with one variable

# create a second regression model
model2 <- lm(points ~ attitude + stra, data = learning2014)

# print out a summary of the model
summary(model2)
## 
## Call:
## lm(formula = points ~ attitude + stra, data = learning2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.6436  -3.3113   0.5575   3.7928  10.9295 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   8.9729     2.3959   3.745  0.00025 ***
## attitude      3.4658     0.5652   6.132 6.31e-09 ***
## stra          0.9137     0.5345   1.709  0.08927 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.289 on 163 degrees of freedom
## Multiple R-squared:  0.2048, Adjusted R-squared:  0.1951 
## F-statistic: 20.99 on 2 and 163 DF,  p-value: 7.734e-09

The attitude variable is still statistically highly significant. The attitude variable is positively related to the exam points, with 3.5 increase with one unit change in attitude for the better conditional on the stra variable. The intercept of this model means that if attitude would be 0, exam points would be 11.6 according to this model. What is interesting, is that when gender variable is removed from the model, the stra variable becomes significant at 10 percent level. It has a coefficient 0.9, meaning that one unit increase in strategic learning is related to a 0.9 increase in exam points, conditional on the attitude variable.

The multiple R-squared is 0.20, which means that the two variables attitude and stra together explain 20 percent of the variation in exam points.

Compared to the last model, the F-statistic has risen as well, and the p-value is smaller.

Diagnostic plots

# draw diagnostic plots using the plot() function. Choose the plots 1, 2 and 5
par(mfrow = c(2,2))
plot(model2,which = c(1,2,5))

The first plot that compares the residuals with the fitted values shows that the points are spread around quite randomly, except for a few outliers, so the model is pretty appropriate. The plot does not indicate non-linear trends or non-constant variance.

The Q-Q-plot shows that the residuals are somewhat normally distributed. However, the lower tail is more heavy, so the values are larger there than would be expected and the upper tail is lighter, so values are smaller there than expected.

The residuals vs leverage plot shows if there are any outliers that would be significant for the model. The Cook’s distance curved lines don’t show in this plot, so the outliers aren’t too disturbing and there aren’t any points that would have high residuals and too much leverage at the same time.


Logistic regression

date()
## [1] "Mon Dec  5 13:34:06 2022"

Read the data

library(tidyverse)
# read in the data that was created
alc <- read.csv(file.path(".", "data", "alc.csv"))

# print names of the variables
colnames(alc)
##  [1] "school"     "sex"        "age"        "address"    "famsize"   
##  [6] "Pstatus"    "Medu"       "Fedu"       "Mjob"       "Fjob"      
## [11] "reason"     "guardian"   "traveltime" "studytime"  "schoolsup" 
## [16] "famsup"     "activities" "nursery"    "higher"     "internet"  
## [21] "romantic"   "famrel"     "freetime"   "goout"      "Dalc"      
## [26] "Walc"       "health"     "failures"   "paid"       "absences"  
## [31] "G1"         "G2"         "G3"         "alc_use"    "high_use"

This data set is constructed from a secondary school student questionnaire of Portuguese schools. As the names of the variables indicate, the questions cover topics such as school and studies, but have also social and demographic aspects. The alc data set is combined from two data sets dealing with the students’ performance in math and Portuguese language.

Relationships between high/low alcohol consumption

The binary variable address is an interesting one, I would predict that when living in urban areas high alcohol use would be more probable, as there are more options to use alcohol than for students living in rural areas.

Another variable that I want to look at is the studytime one (weekly study time, divided into 4 groups, where 4 is the highest amount spent in studying), to see if spending more time studying is negatively related to high alcohol use, as the student is spending more time with studies and is possibly more committed to school.

I would hypothesize also that goout variable, telling how much the student goes out with friends (1-5 scale where 5 is very high) is positively linked to high alcohol use, as alcohol is usually a socially related issue.

Lastly I wanted to include the more continuous variable absences, which measures the number of school absences (0-93). I would hypothesize that more absences are positively linked to high alcohol use, as drinking a lot might affect the student’s attendance at school.

Explore variables’ distributions and relationships with alcohol consumption

First the address variable

library(ggplot2)

# set the background white
theme_set(theme_bw())

# bar plots of address variable
g1 <-  ggplot(data = alc, aes(x = address, fill = high_use)) + 
  geom_bar() +
  theme(legend.position = "none") +
  scale_fill_manual(values = c("paleturquoise3", "paleturquoise4"))

g2 <- ggplot(data = alc, aes(x = address, fill = high_use)) + 
  geom_bar(position = "fill") +
  ylab("proportion") +
  scale_fill_manual(values = c("paleturquoise3", "paleturquoise4"))

# combine the two plots
library(patchwork)
g1 | g2

As can be expected, most of the students live in urban areas, and so more people have high use of alcohol levels in urban areas, but contrary to what I thought relatively more people have high use of alcohol in rural areas.

Next the studytime variable:

# studytime variable
g3 <-  ggplot(data = alc, aes(x = studytime, fill = high_use)) + 
  geom_bar() +
  theme(legend.position = "none") +
  scale_fill_manual(values = c("plum3", "plum4"))

g4 <- ggplot(data = alc, aes(x = studytime, fill = high_use)) + 
  geom_bar(position = "fill") +
  ylab("proportion") +
  scale_fill_manual(values = c("plum3", "plum4"))

# combine the two plots
library(patchwork)
g3 | g4

As I thought, less studytime has higher proportion of students who have high use of alcohol. This is clear for both absolutely and proportionally.

Next the goout variable:

# the going out variable

g5 <-  ggplot(data = alc, aes(x = goout, fill = high_use)) + 
  geom_bar() +
  theme(legend.position = "none") +
  scale_fill_manual(values = c("palevioletred3", "palevioletred4"))

g6 <- ggplot(data = alc, aes(x = goout, fill = high_use)) + 
  geom_bar(position = "fill") +
  ylab("proportion") +
  scale_fill_manual(values = c("palevioletred3", "palevioletred4"))

# combine the two plots
library(patchwork)
g5 | g6

Here the high use of alcohol increases with the amount of going out with friends, and the pattern is very clear.

And lastly the absences variable:

# absences
g7 <- ggplot(data = alc, aes(x = absences, fill = high_use)) + 
  geom_bar() +
  theme(legend.position = "none") +
  scale_fill_manual(values = c("pink3", "pink4"))

g8 <- ggplot(data = alc, aes(x = absences, fill = high_use)) + 
  geom_bar(position = "fill") +
  scale_fill_manual(values = c("pink3", "pink4")) +
  ylab("proportion")

# combine the two plots
library(patchwork)
g7 | g8

Here it is also clear that increase in absences is linked to an increase in high use of alcohol as well, but there are also some outliers with few students who have a very high amount of absences.

Looking at the absences variable closer with a box plot:

# box plot of absences with color by address
ggplot(data = alc, aes(x = high_use, y = absences, col = address)) + 
  geom_boxplot() +
  scale_color_manual(values = c("violetred", "violetred4"))

The median of absences is the same whether student is form rural or urban area when there is no high use of alcohol, but the median for absences is slightly higher for students from rural areas if they belong to the high use of alcohol group.

Next few cross-tabulations. First I wanted to add a table that includes address, the mean of going out and of course high use of alcohol.

# table of address, high use of alcohol and the mean of going out
alc %>% group_by(address, high_use) %>% summarise(count = n(), mean_goout = mean(goout))
## `summarise()` has grouped output by 'address'. You can override using the
## `.groups` argument.
## # A tibble: 4 × 4
## # Groups:   address [2]
##   address high_use count mean_goout
##   <chr>   <lgl>    <int>      <dbl>
## 1 R       FALSE       50       2.6 
## 2 R       TRUE        31       3.52
## 3 U       FALSE      209       2.91
## 4 U       TRUE        80       3.81

The mean of going out increases quite a lot when the alcohol use is high for both rural and urban area students. The mean of going out in both cases of alcohol use is higher for students who live in urban areas compared to rural area students, as would be expected.

Another cross-tabulation of the mean of studytime and mean of absences with regards to alcohol use:

# table of address, high use of alcohol and the mean of going out
alc %>% group_by(high_use) %>% summarise(count = n(), mean_studytime = mean(studytime), mean_absences = mean(absences))
## # A tibble: 2 × 4
##   high_use count mean_studytime mean_absences
##   <lgl>    <int>          <dbl>         <dbl>
## 1 FALSE      259           2.16          3.71
## 2 TRUE       111           1.77          6.38

Those students who have high use of alcohol also have lower mean of studytime and higher mean of absences compared to those who do not have high use of alcohol.

Logistic regression of the chosen variables

# simple model with only the absences as an explanatory variable
model <- glm(high_use ~ absences, data = alc, family = "binomial")

#summary of the simple model
summary(model)
## 
## Call:
## glm(formula = high_use ~ absences, family = "binomial", data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.3598  -0.8209  -0.7318   1.2658   1.7419  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.26945    0.16094  -7.888 3.08e-15 ***
## absences     0.08867    0.02317   3.827  0.00013 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 452.04  on 369  degrees of freedom
## Residual deviance: 434.52  on 368  degrees of freedom
## AIC: 438.52
## 
## Number of Fisher Scoring iterations: 4
####

# model with all the chosen variables: absences, studytime, goout and address
model1 <- glm(high_use ~ absences + studytime + goout + address, data = alc, family = "binomial")

# print out a summary of the model
summary(model1)
## 
## Call:
## glm(formula = high_use ~ absences + studytime + goout + address, 
##     family = "binomial", data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.9982  -0.7761  -0.4950   0.8461   2.3232  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.92997    0.56532  -3.414 0.000640 ***
## absences     0.06687    0.02221   3.011 0.002600 ** 
## studytime   -0.59099    0.16862  -3.505 0.000457 ***
## goout        0.75392    0.12061   6.251 4.09e-10 ***
## addressU    -0.73038    0.30408  -2.402 0.016308 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 452.04  on 369  degrees of freedom
## Residual deviance: 371.44  on 365  degrees of freedom
## AIC: 381.44
## 
## Number of Fisher Scoring iterations: 4
# print out the coefficients of the model
coef(model1)
## (Intercept)    absences   studytime       goout    addressU 
## -1.92997318  0.06687033 -0.59099384  0.75391575 -0.73038272
# compute odds ratios (OR)
OR <- coef(model1) %>% exp

# compute confidence intervals (CI)
CI <- confint(model1) %>% exp()
## Waiting for profiling to be done...
# print out the odds ratios with their confidence intervals
cbind(OR, CI)
##                    OR     2.5 %    97.5 %
## (Intercept) 0.1451521 0.0466309 0.4305313
## absences    1.0691568 1.0250338 1.1197230
## studytime   0.5537766 0.3936409 0.7639234
## goout       2.1253059 1.6882118 2.7118282
## addressU    0.4817246 0.2648416 0.8757487

I looked at first a simple model, where absences was the only explanatory variable, and the AIC was 438.52. After adding the three other variables, the AIC had decreased to 381.44, so adding the other variables has improved the model.

When looking at the summary of the fitted model with all the chosen variables, all the coefficients are statistically significant at the 5 % level. As the exploration of the variables earlier implied, the relationship is positive with absences and goout, so increase in these variables will increase the odds of the student having high use of alcohol. Studytime and address on the other hand decrease the odds of having high use of alcohol.

Looking at the coefficients more closer, absences has an odds ratio of 1.069, meaning that a unit increase in absences increase the odds of the student having high use of alcohol by 6.9% keeping other variables constant, and the confidence interval is between 2.5% and 11.97%. This isn’t as large an effect, as I might have expected.

Studytime has a coefficient that is less than one, 0.55, so when studytime increases by one more unit, it decreases the odds of the student having high use of alcohol by 45% when other variables are constant, and the 95% CI is 61% - 24% decrease in odds. Studytime has an opposite effect when increasing, compared to absences.

The goout variable has an odds ratio of 2.125, meaning that a unit increase in goout increases the odds of the student having high use of alcohol by 112.5% again adjusting for the other variables in the model. The 95% CI is 68.8% - 171%. So going out has quite a large effect, which was apparent when plotting it in the beginning.

The address variable has an odds ratio of 0.48, meaning that if the student lives in an urban area, it decreases the odds of the student having high use of alcohol by 52%, when adjusting for the other variables. The negative effect has a 95% CI between 74% - 13%. This is quite surprising as in the beginning I wasn’t even sure of the direction of the effect.

Explore the predictive power of you model

First a 2x2 cross tabulation of predictions versus the actual values:

# fit the model
model1 <- glm(high_use ~ absences + studytime + goout + address, data = alc, family = "binomial")

# predictions
library(dplyr)
alc <- mutate(alc, probability = predict(model1, type = "response"))
alc <- mutate(alc, prediction = probability > 0.5)


# tabulate the target variable versus the predictions
table(high_use = alc$high_use, prediction = alc$prediction)
##         prediction
## high_use FALSE TRUE
##    FALSE   237   22
##    TRUE     63   48
# initialize a plot of 'high_use' versus 'probability' in 'alc'
ggplot(alc, aes(x = probability, y = high_use, col = prediction)) +
  geom_point() +
  scale_color_manual(values = c("skyblue2", "skyblue4"))

This model predicts high use incorrectly for 22 students, and low use incorrectly for 63 students.

Next the the total proportion of inaccurately classified individuals, the training error.

# define a loss function (mean prediction error)

loss_func <- function(class, prob) {
  n_wrong <- abs(class - prob) > 0.5
  mean(n_wrong)
}

# call loss_func to compute the average number of wrong predictions in the (training) data
loss_func(class = alc$high_use, prob = alc$probability)
## [1] 0.2297297

The training error is 0.2297, so on average ~23% of the predictions are wrong. The training error is below 0.5, so it is better than randomly guessing.

Perform 10-fold cross-validation on your model

#K-fold cross-validation, K=10
library(boot)
## Warning: package 'boot' was built under R version 4.2.2
cv <- cv.glm(data = alc, cost = loss_func, glmfit = model1, K = 10)

# average number of wrong predictions in the cross validation
cv$delta[1]
## [1] 0.2405405

One set of the 10-fold cross-validation gives the prediction error of 0.2378, so 23.78% of the predictions are wrong. This is slightly smaller than the model in the exercise set had, 0.26, so the model that I have used has a bit better test set performance.


Clustering and classification

Load the data

# access the MASS package
library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:patchwork':
## 
##     area
## The following object is masked from 'package:dplyr':
## 
##     select
# load the data
data("Boston")

# explore the dataset
dim(Boston)
## [1] 506  14
str(Boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...

There are 506 observations and 14 variables in this data set from MASS-package. The data is formed from housing values in suburbs of Boston. For example the variable crim tells the per capita crime rate by town.

Graphical overview of the data

library(GGally)
library(ggplot2)
library(reshape2)
## Warning: package 'reshape2' was built under R version 4.2.2
## 
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
## 
##     smiths
library(tidyr)
library(corrplot)
## Warning: package 'corrplot' was built under R version 4.2.2
## corrplot 0.92 loaded
#summary of the variables
summary(Boston)
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08205   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00
# draw a scatter plot matrix of the variables
pairs(Boston)

# the distribution of all the variables
ggplot(data = melt(Boston), aes(x = value)) + 
stat_density() + 
facet_wrap(~variable, scales = "free")
## No id variables; using all as measure variables

# correlations between the variables
cor_matrix <- cor(Boston) %>% 
  round(., digits = 2)

corrplot.mixed(cor_matrix)

Summaries of the variables show that the scale varies a lot for the variables.

When looking at the distributions of the variables, only the variable rm (average number of rooms per dwelling) is close to normal distribution and medv (median value of owner-occupied homes in 1000s) is somewhat normal. The variable of interest, crim, has most of the observations at the left tail as does the variable zn (proportion of residential land zoned for lots over 25,000 sq.ft). The reverse is true for the variable black (1000(Bk−0.63)^2 where Bk is the proportion of blacks by town). There are also few variables that have two peaks at their distribution; indus (proportion of non-retail business acres per town), rad (index of accessibility to radial highways) and tax (full-value property-tax rate per $10,000). Rest of the variables are skewed to left or right.

Finally, when looking at the correlation plot, crim correlates (positively) the most with rad. Largest correlations overall can be found with nox and dis (-0.77) and with rad and tax(0.91). It seems like indus, nox, dis, rad and tax correlate the most with other variables. On the other hand the variable chas (Charles River dummy variable) doesn’t really correlate with any variable.

Standardizing data, creating a categorical variable and forming a train and test set

First standardize the data set

# use the scale function
boston_scaled <- as.data.frame(scale(Boston))
boston_scaled$crim <- as.numeric(boston_scaled$crim)

# summaries of the scaled data
summary(boston_scaled)
##       crim                 zn               indus              chas        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563   Min.   :-0.2723  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668   1st Qu.:-0.2723  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109   Median :-0.2723  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150   3rd Qu.:-0.2723  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202   Max.   : 3.6648  
##       nox                rm               age               dis         
##  Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331   Min.   :-1.2658  
##  1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366   1st Qu.:-0.8049  
##  Median :-0.1441   Median :-0.1084   Median : 0.3171   Median :-0.2790  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059   3rd Qu.: 0.6617  
##  Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164   Max.   : 3.9566  
##       rad               tax             ptratio            black        
##  Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047   Min.   :-3.9033  
##  1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876   1st Qu.: 0.2049  
##  Median :-0.5225   Median :-0.4642   Median : 0.2746   Median : 0.3808  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058   3rd Qu.: 0.4332  
##  Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372   Max.   : 0.4406  
##      lstat              medv        
##  Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 3.5453   Max.   : 2.9865

The initial data had very wide range of values for each variable and the sclales varied a lot depending on the variable, so standardizing has normalized the range of the values.

Next the creation of a factor variable form the crim variable:

# create a quantile vector of crim
bins <- quantile(boston_scaled$crim)

# create a categorical variable 'crime'
crime <- cut(boston_scaled$crim, breaks = bins, labels = c("low", "med_low", "med_high", "high") ,include.lowest = TRUE)

# look at the table of the new factor crime
table(crime)
## crime
##      low  med_low med_high     high 
##      127      126      126      127
# remove original crim from the dataset
boston_scaled <- dplyr::select(boston_scaled, -crim)

# add the new categorical value to scaled data
boston_scaled <- data.frame(boston_scaled, crime)

And finally the creation of train and test sets:

# Divide the dataset to train and test sets, so that 80% of the data belongs to the train set
ind <- sample(nrow(boston_scaled),  size = nrow(boston_scaled) * 0.8)
train <- boston_scaled[ind,]
test <- boston_scaled[-ind,]

Fit the linear discriminant analysis on the train set

# linear discriminant analysis
lda.fit <- lda(crime ~ ., data = train)

# print the lda.fit object
lda.fit
## Call:
## lda(crime ~ ., data = train)
## 
## Prior probabilities of groups:
##       low   med_low  med_high      high 
## 0.2450495 0.2450495 0.2599010 0.2500000 
## 
## Group means:
##                   zn      indus        chas        nox          rm        age
## low       1.01172718 -0.8752106 -0.15302300 -0.8874746  0.44812536 -0.9049706
## med_low  -0.06821363 -0.3100543 -0.07348562 -0.5812391 -0.15431201 -0.4138586
## med_high -0.37943482  0.1899919  0.25261763  0.4214640  0.07438027  0.3628787
## high     -0.48724019  1.0149946 -0.07742312  1.0257316 -0.39507274  0.8142478
##                 dis        rad        tax    ptratio       black       lstat
## low       0.8580458 -0.6895341 -0.7365508 -0.4399664  0.37422872 -0.75561477
## med_low   0.3886425 -0.5422055 -0.4711655 -0.0961035  0.34611912 -0.13633383
## med_high -0.3686524 -0.3934186 -0.2978517 -0.2821187  0.09459946  0.01790898
## high     -0.8462596  1.6596029  1.5294129  0.8057784 -0.76093304  0.83384809
##                  medv
## low       0.531735419
## med_low   0.007086304
## med_high  0.175785494
## high     -0.665419786
## 
## Coefficients of linear discriminants:
##                 LD1          LD2          LD3
## zn       0.10502396  0.743376816 -1.060630293
## indus   -0.02391410 -0.074874169  0.128933780
## chas    -0.08008458 -0.093444926  0.002019129
## nox      0.25934712 -0.888365137 -1.248399434
## rm      -0.11781225 -0.052729890 -0.204580039
## age      0.29243785 -0.205407647 -0.102293784
## dis     -0.11955212 -0.276198749  0.289138509
## rad      3.20784133  0.985175881 -0.222045406
## tax      0.01764723 -0.136767309  0.826994255
## ptratio  0.13587393  0.003572737 -0.409120356
## black   -0.15469217 -0.014850093  0.141581539
## lstat    0.16364684 -0.256671325  0.355169818
## medv     0.16439764 -0.404213862 -0.177042325
## 
## Proportion of trace:
##    LD1    LD2    LD3 
## 0.9467 0.0394 0.0140
# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "navy", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}

# target classes as numeric
classes <- as.numeric(train$crime)

# plot the lda results
plot(lda.fit, col = classes, pch = classes, dimen = 2)
lda.arrows(lda.fit, myscale = 2)

Predict the classes with the LDA model on the test data

# save the crime categories from the test set
# then remove the categorical crime variable from the test dataset
correct_classes <- test$crime
test <- dplyr::select(test, -crime)

Next predict the classes with the LDA model on the test data and cross tabulate the results with the crime categories from the test set.

# predict classes with test data
lda.pred <- predict(lda.fit, newdata = test)

# cross tabulate the results
table(correct = correct_classes, predicted = lda.pred$class)
##           predicted
## correct    low med_low med_high high
##   low       16      11        1    0
##   med_low    2      18        7    0
##   med_high   0       2       19    0
##   high       0       0        1   25

The LDA model has predicted almost all the observations belonging to the high class correctly. On the lower end, there are few observations that have been predicted to the med_low instead of the correct low class. There is more variability in the predictions for the medium classes. The prediction for med_low has wrongly predicted for low class almost as much observations. So the LDA model has more difficulties in predicting the observations belonging in the middle than on the tails.

Investigate what is the optimal number of clusters

Reload the Boston dataset and standardize the dataset:

library(MASS)
data("Boston")

# use the scale function
boston_scaled2 <- as.data.frame(scale(Boston))
boston_scaled2$crim <- as.numeric(boston_scaled2$crim)

Calculate the distances between the observations

# euclidean distance matrix
dist_eu <- dist(boston_scaled2)

# look at the summary of the distances
summary(dist_eu)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1343  3.4625  4.8241  4.9111  6.1863 14.3970
# manhattan distance matrix
dist_man <- dist(boston_scaled2, method = "manhattan")

# look at the summary of the distances
summary(dist_man)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2662  8.4832 12.6090 13.5488 17.7568 48.8618

Run k-means algorithm on the dataset. Investigate what is the optimal number of clusters and run the algorithm again:

set.seed(17)

# k-means clustering
km <- kmeans(boston_scaled2, centers = 3)

# plot the Boston dataset with clusters
pairs(boston_scaled2, col = km$cluster)

# determine the number of clusters
k_max <- 10

# calculate the total within sum of squares
twcss <- sapply(1:k_max, function(k){kmeans(boston_scaled2, k)$tot.withinss})

# visualize the results
qplot(x = 1:k_max, y = twcss, geom = 'line')

# optimal number of clusters seems to be 2, as there is a large drop at that point.
# k-means clustering with two clusters
km <- kmeans(boston_scaled2, centers = 2)

# plot the Boston dataset with clusters
pairs(boston_scaled2, col = c("steelblue3", "sienna3")[km$cluster])

Looking at the pairs function it appears that one of the clusters has always absorbed the observations on one end or direction, so that two clusters looks quite appropriate when comparing the plots of all the variables against each other.

Perform k-means on the original Boston data with some reasonable number of clusters

library(MASS)
data("Boston")

set.seed(19)

# use the scale function to standardize the data
boston_scaled2 <- as.data.frame(scale(Boston))
boston_scaled2$crim <- as.numeric(boston_scaled2$crim)


# k-means clustering with 4 clusters
km <- kmeans(boston_scaled2, centers = 4)

# linear discriminant analysis
lda.fit2 <- lda(km$cluster ~ ., data = boston_scaled2)

# print the lda.fit object
lda.fit2
## Call:
## lda(km$cluster ~ ., data = boston_scaled2)
## 
## Prior probabilities of groups:
##          1          2          3          4 
## 0.39920949 0.31818182 0.05731225 0.22529644 
## 
## Group means:
##         crim         zn        indus        chas          nox         rm
## 1 -0.3554389 -0.4039269 -0.004726028  0.11748284  0.009509773 -0.2448635
## 2 -0.4071151  0.9395565 -0.951448942 -0.12560484 -0.934654735  0.6775631
## 3  3.0022987 -0.4872402  1.014994623 -0.27232907  1.059334474 -1.3064650
## 4  0.4410309 -0.4872402  1.093886784  0.03849464  1.033664373 -0.1906821
##          age       dis        rad        tax     ptratio      black       lstat
## 1  0.3085558 -0.225942 -0.5764965 -0.5036322 -0.09996773  0.2544955  0.08303063
## 2 -1.0699238  1.023927 -0.5966711 -0.6791796 -0.53145201  0.3578780 -0.86641188
## 3  0.9805356 -1.048472  1.6596029  1.5294129  0.80577843 -1.1906614  1.87087595
## 4  0.7148590 -0.779002  1.4419986  1.4625319  0.72271650 -0.6534850  0.60056774
##          medv
## 1 -0.09324226
## 2  0.72802972
## 3 -1.31020021
## 4 -0.52966703
## 
## Coefficients of linear discriminants:
##                 LD1          LD2         LD3
## crim     0.35322461  0.307758951 -1.58997833
## zn       0.09849662 -0.330566750 -0.14199716
## indus    0.27720773 -0.031523290  0.36923953
## chas     0.03859298  0.112548411  0.06821859
## nox      0.07110504 -0.230199232  0.17319460
## rm      -0.27410931 -0.568297722  0.31638241
## age      0.16778369  0.815075738  0.30710945
## dis     -0.28454054 -0.606949418 -0.07519390
## rad      1.75298697 -0.745321060  0.47791025
## tax      1.03317493 -0.661045609  0.17536769
## ptratio  0.16021535 -0.003247889  0.05098139
## black   -0.19237795  0.050874618  0.06252285
## lstat    0.27297507  0.032086428 -0.60051563
## medv     0.09553734 -0.204714316 -0.54447145
## 
## Proportion of trace:
##    LD1    LD2    LD3 
## 0.8386 0.0934 0.0680
# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "navy", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}

# target classes as numeric
classes <- as.numeric(km$cluster)

# plot the lda results
plot(lda.fit2, col = classes, pch = classes, dimen = 2)
lda.arrows(lda.fit2, myscale = 3)

There are 4 distinct clusters that are somewhat separate, but the 4th cluster has some of the observations in the middle of the plot, not clearly belonging to any cluster. Clusters 1 and 2 are close to each other as are clusters 3 and 4.

It appears that rad, tax, age, dis and rm have the highest influence in separating the clusters. Looking at the arrows, tax and rad seem to be explaining more for cluster 4, and rm and dis more for cluster 2 for example.

Create a 3D plot

# run the given code 

model_predictors <- dplyr::select(train, -crime)
# check the dimensions
dim(model_predictors)
## [1] 404  13
dim(lda.fit$scaling)
## [1] 13  3
# matrix multiplication
matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling
matrix_product <- as.data.frame(matrix_product)

library(plotly)
## Warning: package 'plotly' was built under R version 4.2.2
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:MASS':
## 
##     select
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
#first plot
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers')
# target classes as numeric
classes <- as.numeric(train$crime)

# color by classes
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color = classes)
# k-means clustering
km <- kmeans(model_predictors, centers = 4)

# target classes as numeric
classes2 <- as.numeric(km$cluster)

# color by km clusters
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color = classes2)

All the plots have the same observations with most of the observations in a bigger groups and then a separate smaller group. The second plot that has the colors determined by crime classes has the separate smaller group belonging to one crime class. The third plot with the clusters has also the same smaller group in one cluster, but the lines between the clusters change a bit compared to the observations belonging to the crime classes.


Dimensionality reduction techniques

graphical overview of the data

# read in the created dataset
human <- read.table(file.path(".", "data", "human.txt"))

# summaries of the variables
summary(human)
##     Edu2.FM          Labo.FM          Life.Exp        Edu.Exp     
##  Min.   :0.1717   Min.   :0.1857   Min.   :49.00   Min.   : 5.40  
##  1st Qu.:0.7264   1st Qu.:0.5984   1st Qu.:66.30   1st Qu.:11.25  
##  Median :0.9375   Median :0.7535   Median :74.20   Median :13.50  
##  Mean   :0.8529   Mean   :0.7074   Mean   :71.65   Mean   :13.18  
##  3rd Qu.:0.9968   3rd Qu.:0.8535   3rd Qu.:77.25   3rd Qu.:15.20  
##  Max.   :1.4967   Max.   :1.0380   Max.   :83.50   Max.   :20.20  
##       GNI            Mat.Mor         Ado.Birth         Parli.F     
##  Min.   :   581   Min.   :   1.0   Min.   :  0.60   Min.   : 0.00  
##  1st Qu.:  4198   1st Qu.:  11.5   1st Qu.: 12.65   1st Qu.:12.40  
##  Median : 12040   Median :  49.0   Median : 33.60   Median :19.30  
##  Mean   : 17628   Mean   : 149.1   Mean   : 47.16   Mean   :20.91  
##  3rd Qu.: 24512   3rd Qu.: 190.0   3rd Qu.: 71.95   3rd Qu.:27.95  
##  Max.   :123124   Max.   :1100.0   Max.   :204.80   Max.   :57.50
# visualize the "human" variables
library(GGally)
ggpairs(human)

# compute the correlation matrix and visualize it with corrplot
library(corrplot)
cor(human) %>% corrplot()

First looking at the summaries of the variables, the scale varies a lot depending on the variable. The Gross national income variable, GNI, has the largest scale, and on the other hand the relation of female over male in secondary education (Edu2.FM) and labor force participation (Labo.FM) have the smallest scale.

The distributions differ somewhat as well. The Edu2.FM and Edu.Exp variables have a more normal distribution, compared to GNI, Mat.Mor and Ado.Birth, which have most of the observations at the left tail. Rest of the variables are slightly skewed.

All of the variables have some statistically significant correlations with each other, which is necessary for the principal component analysis. Variables Parli.F and Labo.FM correlate the least with any other variable. Mat.Mor and Ado.Birth correlate negatively with the other variables, when the correlation is significant. Rest of the variables correlate mostly positively with each other.

PCA on the raw (non-standardized) data

# perform principal component analysis (with the SVD method)
pca_human <- prcomp(human)
summary(pca_human)
## Importance of components:
##                              PC1      PC2   PC3   PC4   PC5   PC6    PC7    PC8
## Standard deviation     1.854e+04 185.5219 25.19 11.45 3.766 1.566 0.1912 0.1591
## Proportion of Variance 9.999e-01   0.0001  0.00  0.00 0.000 0.000 0.0000 0.0000
## Cumulative Proportion  9.999e-01   1.0000  1.00  1.00 1.000 1.000 1.0000 1.0000
# how much variance each dimension explains
library(factoextra)
## Warning: package 'factoextra' was built under R version 4.2.2
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
fviz_eig(pca_human, main = "Scree plot: raw data")

# draw a biplot of the principal component representation and the original variables
biplot(pca_human, choices = 1:2, cex = c(0.8, 1), col = c("darkslategray4", "darkslategrey"), sub = "Large scale of gross national income weighs against all else")
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

PCA on the standardized data

# standardize the variables
human_std <- scale(human)

# print out summaries of the standardized variables
summary(human_std)
##     Edu2.FM           Labo.FM           Life.Exp          Edu.Exp       
##  Min.   :-2.8189   Min.   :-2.6247   Min.   :-2.7188   Min.   :-2.7378  
##  1st Qu.:-0.5233   1st Qu.:-0.5484   1st Qu.:-0.6425   1st Qu.:-0.6782  
##  Median : 0.3503   Median : 0.2316   Median : 0.3056   Median : 0.1140  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.5958   3rd Qu.: 0.7350   3rd Qu.: 0.6717   3rd Qu.: 0.7126  
##  Max.   : 2.6646   Max.   : 1.6632   Max.   : 1.4218   Max.   : 2.4730  
##       GNI             Mat.Mor          Ado.Birth          Parli.F       
##  Min.   :-0.9193   Min.   :-0.6992   Min.   :-1.1325   Min.   :-1.8203  
##  1st Qu.:-0.7243   1st Qu.:-0.6496   1st Qu.:-0.8394   1st Qu.:-0.7409  
##  Median :-0.3013   Median :-0.4726   Median :-0.3298   Median :-0.1403  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.3712   3rd Qu.: 0.1932   3rd Qu.: 0.6030   3rd Qu.: 0.6127  
##  Max.   : 5.6890   Max.   : 4.4899   Max.   : 3.8344   Max.   : 3.1850
# perform principal component analysis (with the SVD method)
pca_human_std <- prcomp(human_std)
summary(pca_human_std)
## Importance of components:
##                           PC1    PC2     PC3     PC4     PC5     PC6     PC7
## Standard deviation     2.0708 1.1397 0.87505 0.77886 0.66196 0.53631 0.45900
## Proportion of Variance 0.5361 0.1624 0.09571 0.07583 0.05477 0.03595 0.02634
## Cumulative Proportion  0.5361 0.6984 0.79413 0.86996 0.92473 0.96069 0.98702
##                            PC8
## Standard deviation     0.32224
## Proportion of Variance 0.01298
## Cumulative Proportion  1.00000
# proportion of variance explained
library(factoextra)
fviz_eig(pca_human_std, main = "Scree plot: standardized data")

# draw a biplot of the principal component representation and the original variables
biplot(pca_human_std, choices = 1:2, cex = c(0.8, 1), col = c("darkslategray4", "darkslategrey"),sub = "1: Human development vs maternal and reproductive inequalities, 2: Femle participation")

First looking at the standardized variables where all of them have a mean 0, GNI still has the largest maximum value, but now the large standard error is standardized.

Looking at the raw data, the first principal component (PC1) captures 0.99% of the variance and the PC2 0.0001% of the variance. All the variance is captured by the first two PCs, principally only by the PC1. Comparing this to the standardized data where the first principal component captures 0.5361% of the variance and the PC2 0.1624%, it is not dominated by just one PC. Cumulatively these two capture almost 70% of the variance. The standard deviations are quite small, 2 for the PC1, compared to the raw data case, where he standard deviation for PC1 was very large, 18540, which was almost the exact standard deviation for GNI.

Looking at the biplot of raw data, it is clear that for PC1, the GNI variable has a very large effect compared to all the other variables, and this effect takes over everything else. As noted in the beginning when looking at the summaries, GNI had a very large scale compared to all the variables. With the standardized data, other variables have more weight when the differing scales don’t dominate.

personal interpretations

For PC1, there are two groups of variables that have an impact; on the other side the two variables that negatively correlated with others, Ado.Birth and Mat.Mor, and on the other side Edu2.FM, Life.Exp, Edu.Exp and GNI. These 4 correlated positively with each other, so they are all grouped on the left side. This dimension therefore is about human development vs maternal and reproductive inequalities.

for PC2 the two variables that hardly correlated with any other variable, Labo.FM and Parli.F, form the third group. So this dimension is about female participation in government and labor force compared to men.

Multiple Correspondence Analysis (MCA) on the tea data

# read in the data
library(dplyr)
library(tidyr)
library(ggplot2)
tea <- read.csv("https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/tea.csv", stringsAsFactors = TRUE)

view(tea)
str(tea)
## 'data.frame':    300 obs. of  36 variables:
##  $ breakfast       : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
##  $ tea.time        : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
##  $ evening         : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
##  $ lunch           : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
##  $ dinner          : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
##  $ always          : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
##  $ home            : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
##  $ work            : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
##  $ tearoom         : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
##  $ friends         : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
##  $ resto           : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
##  $ pub             : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Tea             : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
##  $ How             : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
##  $ sugar           : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
##  $ how             : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ where           : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ price           : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
##  $ age             : int  39 45 47 23 48 21 37 36 40 37 ...
##  $ sex             : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
##  $ SPC             : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
##  $ Sport           : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
##  $ age_Q           : Factor w/ 5 levels "+60","15-24",..: 4 5 5 2 5 2 4 4 4 4 ...
##  $ frequency       : Factor w/ 4 levels "+2/day","1 to 2/week",..: 3 3 1 3 1 3 4 2 1 1 ...
##  $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
##  $ spirituality    : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
##  $ healthy         : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
##  $ diuretic        : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
##  $ friendliness    : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
##  $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ feminine        : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
##  $ sophisticated   : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
##  $ slimming        : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ exciting        : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
##  $ relaxing        : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
##  $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...
dim(tea)
## [1] 300  36
# column names to keep in the dataset
keep_columns <- c("breakfast", "tea.time","lunch", "dinner", "evening", "frequency", "home", "work", "tearoom")

# select the 'keep_columns' to create a new dataset
tea_set <- select(tea, one_of(keep_columns))

# look at the summaries and structure of the data
summary(tea_set)
##          breakfast           tea.time         lunch            dinner   
##  breakfast    :144   Not.tea time:131   lunch    : 44   dinner    : 21  
##  Not.breakfast:156   tea time    :169   Not.lunch:256   Not.dinner:279  
##                                                                         
##                                                                         
##         evening          frequency         home           work    
##  evening    :103   +2/day     :127   home    :291   Not.work:213  
##  Not.evening:197   1 to 2/week: 44   Not.home:  9   work    : 87  
##                    1/day      : 95                                
##                    3 to 6/week: 34                                
##         tearoom   
##  Not.tearoom:242  
##  tearoom    : 58  
##                   
## 
str(tea_set)
## 'data.frame':    300 obs. of  9 variables:
##  $ breakfast: Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
##  $ tea.time : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
##  $ lunch    : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
##  $ dinner   : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
##  $ evening  : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
##  $ frequency: Factor w/ 4 levels "+2/day","1 to 2/week",..: 3 3 1 3 1 3 4 2 1 1 ...
##  $ home     : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
##  $ work     : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
##  $ tearoom  : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
# visualize the dataset
library(ggplot2)
pivot_longer(tea_set, cols = everything()) %>% 
  ggplot(aes(value)) + facet_wrap("name", scales = "free") +
  geom_bar(fill = "deeppink4") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))

There are 300 observations and 36 variables in this data set. I chose the following columns: breakfast, tea.time,lunch, dinner and evening to have the different times to drink tea, frequency of the tea drinking, and finally home, work and tearoom, so also few of the places as well where to drink tea. All of these variables are factors, most of them have just two levels (yes, no), but the frequancy variable has 4 levles.Breakfast and tea.time have more evenly divided amount of responses, but dinner, home, lunch and tearoom are quite concentrated on one side.

library(FactoMineR)
## Warning: package 'FactoMineR' was built under R version 4.2.2
mca <- MCA(tea_set, graph = FALSE)

# summary of the model
summary(mca)
## 
## Call:
## MCA(X = tea_set, graph = FALSE) 
## 
## 
## Eigenvalues
##                        Dim.1   Dim.2   Dim.3   Dim.4   Dim.5   Dim.6   Dim.7
## Variance               0.204   0.158   0.132   0.121   0.114   0.100   0.089
## % of var.             16.727  12.958  10.807   9.924   9.334   8.181   7.314
## Cumulative % of var.  16.727  29.685  40.492  50.416  59.750  67.931  75.246
##                        Dim.8   Dim.9  Dim.10  Dim.11
## Variance               0.088   0.085   0.069   0.061
## % of var.              7.188   6.992   5.611   4.963
## Cumulative % of var.  82.434  89.426  95.037 100.000
## 
## Individuals (the 10 first)
##                  Dim.1    ctr   cos2    Dim.2    ctr   cos2    Dim.3    ctr
## 1             |  0.234  0.090  0.083 | -0.644  0.874  0.625 | -0.131  0.044
## 2             |  0.234  0.090  0.083 | -0.644  0.874  0.625 | -0.131  0.044
## 3             |  0.170  0.047  0.012 |  0.308  0.200  0.040 |  0.224  0.126
## 4             |  1.000  1.631  0.473 | -0.369  0.286  0.064 |  0.055  0.008
## 5             | -0.132  0.028  0.024 | -0.065  0.009  0.006 | -0.520  0.681
## 6             |  1.000  1.631  0.473 | -0.369  0.286  0.064 |  0.055  0.008
## 7             | -0.070  0.008  0.004 | -0.120  0.030  0.012 | -0.288  0.209
## 8             |  0.307  0.154  0.082 |  0.658  0.910  0.376 |  0.029  0.002
## 9             | -0.311  0.158  0.186 | -0.327  0.225  0.206 | -0.141  0.050
## 10            | -0.394  0.253  0.133 |  0.112  0.027  0.011 | -0.150  0.057
##                 cos2  
## 1              0.026 |
## 2              0.026 |
## 3              0.021 |
## 4              0.001 |
## 5              0.369 |
## 6              0.001 |
## 7              0.067 |
## 8              0.001 |
## 9              0.039 |
## 10             0.019 |
## 
## Categories (the 10 first)
##                   Dim.1     ctr    cos2  v.test     Dim.2     ctr    cos2
## breakfast     |  -0.604   9.505   0.336 -10.028 |  -0.535   9.627   0.264
## Not.breakfast |   0.557   8.773   0.336  10.028 |   0.494   8.886   0.264
## Not.tea time  |   0.626   9.290   0.303   9.525 |  -0.179   0.979   0.025
## tea time      |  -0.485   7.201   0.303  -9.525 |   0.139   0.759   0.025
## lunch         |  -0.897   6.419   0.138  -6.433 |   0.960   9.480   0.158
## Not.lunch     |   0.154   1.103   0.138   6.433 |  -0.165   1.629   0.158
## dinner        |   1.819  12.586   0.249   8.629 |  -0.039   0.007   0.000
## Not.dinner    |  -0.137   0.947   0.249  -8.629 |   0.003   0.001   0.000
## evening       |  -0.250   1.170   0.033  -3.131 |   0.825  16.378   0.356
## Not.evening   |   0.131   0.612   0.033   3.131 |  -0.431   8.563   0.356
##                v.test     Dim.3     ctr    cos2  v.test  
## breakfast      -8.883 |  -0.106   0.454   0.010  -1.762 |
## Not.breakfast   8.883 |   0.098   0.419   0.010   1.762 |
## Not.tea time   -2.722 |  -0.409   6.155   0.130  -6.232 |
## tea time        2.722 |   0.317   4.771   0.130   6.232 |
## lunch           6.881 |  -0.707   6.168   0.086  -5.069 |
## Not.lunch      -6.881 |   0.122   1.060   0.086   5.069 |
## dinner         -0.184 |   0.378   0.842   0.011   1.794 |
## Not.dinner      0.184 |  -0.028   0.063   0.011  -1.794 |
## evening        10.310 |  -0.335   3.239   0.059  -4.187 |
## Not.evening   -10.310 |   0.175   1.694   0.059   4.187 |
## 
## Categorical variables (eta2)
##                 Dim.1 Dim.2 Dim.3  
## breakfast     | 0.336 0.264 0.010 |
## tea.time      | 0.303 0.025 0.130 |
## lunch         | 0.138 0.158 0.086 |
## dinner        | 0.249 0.000 0.011 |
## evening       | 0.033 0.356 0.059 |
## frequency     | 0.427 0.501 0.220 |
## home          | 0.052 0.058 0.199 |
## work          | 0.124 0.000 0.246 |
## tearoom       | 0.177 0.063 0.227 |
# visualize MCA
plot(mca, invisible=c("ind"), graph.type = "classic", habillage = "quali")

When looking at the summary, the first two dimensions compute about 30% of the variance. What can be seen in the bar plots of contributions (below this text) to the variance for dimensions 1-2, is that variables breakfast and frequency (levels 1 to 2/ week and 1 / day) are at the top. The four levels that had the least to contribute are Not.lunch, Not.work, Not.dinner and home. These are the same variables that had most of the answers from the two levels of a variable. As the dinner level for example had very few people, it is very far from the center in the MCA factor map for the first dimension.

# another useful package
library(factoextra)

# Contributions of rows to dimension 1
s1 <- fviz_contrib(mca, choice = "var", axes = 1, top = 20)
# Contributions of rows to dimension 2
s2 <- fviz_contrib(mca, choice = "var", axes = 2, top = 20)

# combine the two plots
library(patchwork)
s1 | s2

#The total contributions to dimension 1 and 2 are obtained as follow:
fviz_contrib(mca, choice = "var", axes = 1:2, top = 20)

# color by contribution
fviz_mca_var(mca, col.var = "contrib",
             gradient.cols = c("springgreen1", "springgreen3", "darkgreen"), 
             repel = TRUE, # avoid text overlapping (slow)
             ggtheme = theme_minimal()
             )

I think this last plot is quite demonstrative of the contributions.